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Abstract 

Variable selection plays an important role in high dimensional statistical modeling which nowa- 
days appears in many areas and is key to various scientific discoveries. For problems of large scale 
or dimensionality p, estimation accuracy and computational cost are two top concerns. In a recent 
paper, Candes and Tao (2007) propose the Dantzig selector using Li regularization and show that it 
achieves the ideal risk up to a logarithmic factor log p. Their innovative procedure and remarkable 
result are challenged when the dimensionality is ultra high as the factor logp can be large and their 
uniform uncertainty principle can fail. 

Motivated by these concerns, we introduce the concept of sure screening and propose a sure 
screening method based on a correlation learning, called the Sure Independence Screening (SIS), to 
reduce dimensionality from high to a moderate scale that is below sample size. In a fairly general 
asymptotic framework, the correlation learning is shown to have the sure screening property for even 
exponentially growing dimensionality. As a methodological extension, an iterative SIS (ISIS) is also 
proposed to enhance its finite sample performance. With dimension reduced accurately from high to 
below sample size, variable selection can be improved on both speed and accuracy, and can then be 
accomplished by a well-developed method such as the SCAD, Dantzig selector. Lasso, or adaptive 
Lasso. The connections of these penalized least-squares methods are also elucidated. 
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1 Introduction 



1.1 Background 

Consider the problem of estimating a p- vector of parameters /3 from the linear model 

y = X(3 + e, (1) 

where y = (li, • • • , Yn)'^ is an n-vector of responses, X = (xi, • • • , x„)^ is an n x p random design 
matrix with i.i.d. xi, • • • , x„, P = • • • , /3p)^ is a p-vector of parameters, and e = {ei, ■ ■ ■ , En)^ 
is an n-vector of i.i.d. random errors. When dimension p is high, it is often assumed that only a small 
number of predictors among Xi , • • • , Xp contribute to the response, which amounts to assuming ideally 
that the parameter vector (3 is sparse. With sparsity, variable selection can improve estimation accuracy 
by effectively identifying the subset of important predictors, and also enhance model interpretability 
with parsimonious representation. 

Sparsity comes frequently with high dimensional data, which is a growing feature in many areas 
of contemporary statistics. The problems arise frequently in genomics such as gene expression and 
proteomics studies, biomedical imaging, functional MRI, tomography, tumor classifications, signal pro- 
cessing, image analysis, and finance, where the number of variables or parameters p can be much larger 
than sample size n. For instance, one may wish to classify tumors using microarray gene expression 
or proteomics data; one may wish to associate protein concentrations with expression of genes or pre- 
dict certain clinical prognosis (e.g., injury scores or survival time) using gene expression data. For this 
kind of problems, the dimensionality can be much larger than the sample size, which calls for new or 
extended statistical methodologies and theories. See, e.g., Donoho (2000) and Fan and Li (2006) for 
overviews of statistical challenges with high dimensionality. 

Back to the problem in ([T]), it is challenging to find tens of important variables out of thousands 
of predictors, with number of observations usually in tens or hundreds. This is similar- to finding a 
couple of needles in a huge haystack. A new idea in Candes and Tao (2007) is the notion of uniform 
uncertainty principle (UUP) on deterministic design matrices. They proposed the Dantzig selector, 
which is the solution to an ^i-regularization problem, and showed that under UUP, this minimum 
estimator achieves the ideal risk, i.e., the risk of the oracle estimator with the true model known ahead of 
time, up to a logarithmic factor log p. Appealing features of the Dantzig selector include: 1) it is easy to 
implement because the convex optimization the Dantzig selector solves can easily be recast as a linear 
program; and 2) it has the oracle property in the sense of Donoho and Johnstone (1994). 

Despite their remarkable achievement, we still have four concerns when the Dantzig selector is 
applied to high or ultra-high dimensional problems. First, a potential hurdle is the computational cost 
for large or huge scale problems such as implementing linear programs in dimension tens or hundreds 
of thousands. Second, the factor \ogp can become large and may not be negligible when dimension p 
grows rapidly with sample size n. Third, as dimensionality grows, their UUP condition may be hard 
to satisfy, which will be illustrated later using a simulated example. Finally, there is no guarantee the 
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Dantzig selector picks up the right model though it has the oracle property. These four concerns inspire 
our work. 

1.2 Dimensionality reduction 

Dimension reduction or feature selection is an effective strategy to deal with high dimensionality. With 
dimensionality reduced from high to low, computational burden can be reduced drastically. Meanwhile, 
accurate estimation can be obtained by using some well-developed lower dimensional method. Moti- 
vated by this along with those concerns on the Dantzig selector, we have the following main goal in our 
paper: 

• Reduce dimensionality p from a large or huge scale (say, exp(0(?i^)) for some ^ > 0) to a 
relatively large scale d (e.g., o(n)) by a fast and efficient method. 

We achieve this by introducing the concept of sure screening and proposing a sure screening method 
based on a con^elation learning which filters out the features that have weak correlation with the re- 
sponse. Such a correlation screening is called Sure Independence Screening (SIS). Here and below, 
by sure screening we mean a property that all the important variables survive after variable screening 
with probability tending to one. This dramatically narrows down the search for important predictors. In 
particular, applying the Dantzig selector to the much smaller submodel relaxes our first concern on the 
computational cost. In fact, this not only speeds up the Dantzig selector, but also reduces the logarithmic 
factor in mimicking the ideal risk from log p to log d, which is smaller than log n and hence relaxes our 
second concern above. It also addresses he third concern since the UUP condition is easier to satisfy. 

Oracle properties in a stronger sense, say, mimicking the oracle in not only selecting the right model, 
but also estimating the parameters efficiently, give a positive answer to our third and fourth concerns 
above. Theories on oracle properties in this sense have been developed in the literature. Fan and Li 
(2001) lay down groundwork on variable selection problems in the finite parameter setting. They dis- 
cussed a family of variable selection methods that adopt a penalized likelihood approach, which includes 
well-established methods such as the AIC and BIC, as well as more recent methods like the bridge re- 
gression in Frank and Friedman (1993), Lasso in Tibshirani (1996), and SCAD in Fan (1997) and Anto- 
niadis and Fan (2001), and established oracle properties for nonconcave penalized likelihood estimators. 
Later on. Fan and Peng (2004) extend the results to the setting of p = o(n^/^) and show that the oracle 
properties continue to hold. An effective algorithm for optimizing penalized likelihood, local quadratic 
approximation (LQA), was proposed in Fan and Li (2001) and well studied in Hunter and Li (2005). 
Zou (2006) introduces an adaptive Lasso in a finite parameter setting and shows that Lasso does not 
have oracle properties as conjectured in Fan and Li (2001), whereas the adaptive Lasso does. Zou and 
Li (2008) propose a local linear approximation algorithm that recasts the computation of non-concave 
penalized likelihood problems into a sequence of penalized Li -likelihood problems. They also proposed 
and studied the one-step sparse estimators for nonconcave penalized likelihood methods. 
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There is a huge literature on the problem of variable selection. To name a few in addition to those 
mentioned above, Fan and Li (2002) study variable selection for Cox's proportional hazards model and 
frailty model; Efron, Hastie, Johnstone and Tibshirani (2004) propose LARS; Hunter and Li (2005) 
propose a new class of algorithms, MM algorithms, for variable selection; Meinshausen and Biihlmann 
(2006) look at the problem of variable selection with the Lasso for high dimensional graphs, and Zhao 
and Yu (2006) give an almost necessary and sufficient condition on model selection consistency of 
Lasso. Meier, van de Geer and Biihlmann (2008) proposed a fast implementation for group Lasso. More 
recent studies include Huang, Horowitz and Ma (2008), Paul et al. (2007), Zhang (2007), and Zhang 
and Huang (2008), which signficantly advances the theory and methods of the penalized least-squares 
approaches. It is worth to mention that in variable selection, there is a weaker concept than consistency, 
called persistency, introduced by Greenshtein and Ritov (2004). Motivation of this concept lies in the 
fact that in machine learning such as tumor classifications, the primary interest centers on the misclassi- 
fication errors or more generally expected losses, not the accuracy of estimated parameters. Greenshtein 
and Ritov (2004) study the persistency of Lasso-type procedures in high dimensional linear predictor se- 
lection, and Greenshtein (2006) extends the results to more general loss functions. Meinshausen (2007) 
considers a case with finite nonsparsity and shows that under quadratic loss. Lasso is persistent, but the 
rate of persistency is slower than that of a relaxed Lasso. 

1.3 Some insight on high dimensionality 

To gain some insight on challenges of high dimensionaUty in variable selection, let us look at a situation 
where all the predictors Xi, - ■ ■ , Xp are standardized and the distribution of z = Xl~^/^x is spherically 
symmetric, where x = [Xi, • • • , Xpj^ and S = cov (x). Clearly, the transformed predictor vector z has 
covariance matrix Ip. Our way of study in this paper is to separate the impacts of the covariance matrix 
S and the distribution of z, which gives us a better understanding on difficulties of high dimensionality 
in variable selection. 

The real difficulty when dimension p is larger than sample size n comes from four facts. First, the 
design matrix X is rectangular, having more columns than rows. In this case, the matrix X^X is huge 
and singular. The maximum spurious con^elation between a covariate and the response can be large (see, 
e.g.. Figure 1) because of the dimensionality and the fact that an unimportant predictor can be highly 
correlated with the response variable due to the presence of important predictors associated with the pre- 
dictor. These make variable selection difficult. Second, the population covariance matrix S may become 
ill-conditioned as n grows, which adds difficulty to variable selection. Third, the minimum nonzero ab- 
solute coefficient may decay with n and get close to the noise level, say, the order (logp/n)~^/^. 
Fourth, the distribution of z may have heavy tails. Therefore, in general, it is challenging to estimate the 
sparse parameter vector (3 accurately when p ^ n. 

When dimension p is large, some of the intuition might not be accurate. This is exemplified by 
the data piling problems in high dimensional space observed in Hall, Marron and Neeman (2005). A 
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Maximum absolute sample correlation 

Figure 1: Distributions of the maximum absolute sample correlation coefficient when n = 60, p = 1000 
(solid curve) and n = 60,p = 5000 (dashed curve), based on 500 simulations. 

challenge with high dimensionality is that important predictors can be highly correlated with some 
unimportant ones, which usually increases with dimensionality. The maximum spurious correlation 
also grows with dimensionality. We illustrate this using a simple example. Suppose the predictors 
Xi, - ■ ■ , Xp are independent and follow the standard normal distribution. Then, the design matrix is 
an n X p random matrix, each entry an independent realization from J\f{0, 1). The maximum absolute 
sample correlation coefficient among predictors can be very large. This is indeed against our intuition, as 
the predictors are independent. To show this, we simulated 500 data sets with 7i = 60 and p = 1000 and 
p = 5000, respectively. Figure 1 shows the distributions of the maximum absolute sample correlation. 
The multiple canonical correlation between two groups of predictors (e.g., 2 in one group and 3 in 
another) can even be much larger, as there are already (^) (^3^) = 0{p^) choices of the two groups in 
our example. Hence, sure screening when p is large is very challenging. 

The paper is organized as follows. In the next section we propose a sure screening method Sure 
Independence Screening (SIS) and discuss its rationale as well as its connection with other methods of 
dimensionality reduction. In Section 3 we review several known techniques for model selection in the 
reduced feature space and present two simulations and one real data example to study the performance 
of SIS based model selection methods. In Section 4 we discuss some extensions of SIS and in partic- 
ular, an iterative SIS is proposed and illustrated by thi^ee simulated examples. Section 5 is devoted to 
the asymptotic analysis of SIS, an iteratively thresholded ridge regression screener as well as two SIS 
based model selection methods. Some concluding remarks are given in Section 6. Technical details are 
provided in the Appendix. 



5 



2 Sure Independence Screening 



2.1 A sure screening method: correlation learning 

By sure screening we mean a property that all the important variables survive after applying a variable 
screening procedure with probability tending to one. A dimensionaUty reduction method is desirable if 
it has the sure screening property. Below we introduce a simple sure screening method using compo- 
nentwise regression or equivalently a correlation learning. Throughout the paper we center each input 
variable so that the observed mean is zero, and scale each predictor so that the sample standard deviation 
is one. Let A^* = {l<i<p:/3j/0}be the true sparse model with nonsparsity size s = |A^*|. The 
other p — s variables can also be coiTclated with the response variable via linkage to the predictors con- 
tained in the model. Let u = {uji, • • • , Wp)^ be a vector obtained by the componentwise regression, 
that is. 



where the n x p data matrix X is first standardized columnwise as mentioned before. Hence, uj is 
really a vector of marginal correlations of predictors with the response variable, rescaled by the standard 
deviation of the response. 

For any given 7 G (0, 1), we sort the p componentwise magnitudes of the vector in a decreasing 
order and define a submodel 



where [yn] denotes the integer part of 771. This is a straightforward way to shrink the full model 
{1, • • • ,p} down to a submodel A4^ with size d = [jn] < n. Such a correlation learning ranks the 
importance of features according to their marginal correlation with the response variable and filters out 
those that have weak marginal correlations with the response variable. We call this correlation screening 
method Sure Independence Screening (SIS), since each feature is used independently as a predictor to 
decide how useful it is for predicting the response variable. This concept is broader than the correla- 
tion screening and is applicable to generalized linear models, classification problems under various loss 
functions, and nonparametric learning under sparse additive models. 

The computational cost of correlation learning or SIS is that of multiplying a p x n matrix with 
an n-vector plus getting the largest d components of a p-vector, so SIS has computational complexity 
0{np). 

It is worth to mention that SIS uses only the order of componentwise magnitudes of uj, so it is indeed 
invariant under scaling. Thus the idea of SIS is identical to selecting predictors using their correlations 
with the response. To implement SIS, we note that linear- models with more than n parameters are 
not identifiable with only n data points. Hence, we may choose d = [yn] to be conservative, for 
instance, n — 1 or n/ log n depending on the order of sample size n. Although SIS is proposed to reduce 
dimensionality p from high to below sample size 7i, nothing can stop us applying it with final model size 




(2) 



■M-'y = {1 < i < P : l"^*! is among the first [771] largest of all} 



(3) 
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d > n, say, 7 > 1. It is obvious that larger d means larger probability to include the true model TW* in 
the final model A4^. 

SIS is a hard-thresholding-type method. For orthogonal design matrices, it is well understood. But 
for general design matrices, there is no theoretical support for it, though this kind of idea is frequently 
used in applications. It is important to identify the conditions under which the sure screening property 
holds for SIS, i.e., 

P{M*CM^)^1 asn^cx) (4) 

for some given 7. This question as well as how the sequence 7 = 7n ^ should be chosen will 
be answered by Theorem 1 in Section 5. We would like to point out that the Simple Thresholding 
Algorithm (see, e.g.. Baron et ai, 2005 and Gribonval et al, 2007) that is used in sparse approximation 
or compressed sensing is a one step greedy algorithm and related to SIS. In particular, our asymptotic 
analysis in Section 5 helps to understand the performance of the Simple Thresholding Algorithm. 

2.2 Rationale of correlation learning 

To better understand the rationale of the correlation learning, we now introduce an iteratively thresholded 
ridge regression screener (ITRRS), which is an extension of the dimensionality reduction method SIS. 
But for practical implementation, only the correlation learning is needed. ITRRS also provides a very 
nice technical tool for our understanding of the sure screening property of the correlation screening and 
other methods. 

When there are more predictors than observations, it is well known that the least squares estimator 
/3ls = (X^X) ^ X^y is noisy, where (X^X) ^ denotes the Moore-Penrose generalized inverse of X^X. 
We therefore consider the ridge regression, namely, linear regression with ^2-regularization to reduce 
the variance. Let u)'^ = {u^, ■ ■ ■ , be a p- vector obtained by the ridge regression, that is, 

u^ = {X^X + XIpy^X^y, (5) 

where A > is a regularization parameter. It is obvious that 

^^^3ls asA^O, (6) 

and the scaled ridge regression estimator tends to the componentwise regression estimator: 

Xuj'^ (jj as A ^ 00. (7) 

In view of Q, to make u}^ less noisy we should choose large regularization parameter A to reduce the 
variance in the estimation. Note that the ranking of the absolute components of uj^ is the same as that 
of Ao;^. In light of ([7]) the componentwise regression estimator is a specific case of the ridge regression 
with regularization parameter A = 00, namely, it makes the resulting estimator as less noisy as possible. 

For any given 5 € (0, 1), we sort the p componentwise magnitudes of the vector u'^ in a descending 
order and define a submodel 

-^5,A = ] 1 < ^ !^ P : \ is among the first [5p\ largest of all > . (8) 
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This procedure reduces the model size by a factor of 1 — 5. The idea of ITRRS to be introduced below 
is to perform dimensionality reduction as above successively until the number of remaining variables 
drops to below sample size n. 

It will be shown in Theorem 2 in Section 5 that under some regularity conditions and when the 
tuning parameters A and 6 are chosen appropriately, with overwhelming probability the submodel 
will contain the true model A4 and its size is an order for some 9 > lower than the original one p. 
This property stimulates us to propose ITRRS as follows: 

• First, carry out the procedure in (HJ to the full model {!,••• ,p} and get a submodel Aig^ ^^^^ 
size [5p]; 

• Then, apply a similar procedure to the model M-l ^ and again obtain a submodel M'j ^ C 
with size [6'^p] , and so on; 

• Finally, get a submodel Ais,\ = a ^^^^ ^^^^ ^ ~ i^'^P] < where [6^^^p] > n. 

We would like to point out that the above procedure is different from the threshholded ridge regression, 
as the submodels and estimated parameters change over the course of iterations. The only exception is 
the case that A = oo, in which the rank of variables do not vary with iterations. 

Now we are ready to see that the correlation learning introduced in Section 2.1 is a specific case 
of ITRRS since the componentwise regression is a specific case of the ridge regression with an infinite 
regularization parameter. The ITRRS provides a very nice technical tool for understanding how fast 
the dimension p can grow compared with sample size n and how the final model size d can be chosen 
while the sure screening property still holds for the coiTclation learning. The question of whether ITRRS 
has the sure screening property as well as how the tuning parameters 7 and 5 should be chosen will be 
answered by Theorem 3 in Section 5. 

The number of steps in ITRRS depends on the choice of 5 G (0, 1). We will see in Theorem 3 that 
5 can not be chosen too small which means that there should not be too many iteration steps in ITRRS. 
This is due to the cumulation of the probability errors of missing some important variables over the 
iterations. In particular, the backward stepwise deletion regression which deletes one variable each time 
in ITRRS until the number of remaining variables drops to below sample size might not work in general 
as it requires p — d iterations. When p is of exponential order, even though the probability of mistakenly 
deleting some important predictors in each step of deletion is exponentially small, the cumulative error 
in exponential order of operations may not be negligible. 

2.3 Connections with other dimensionality reduction methods 

As pointed out before, SIS uses the marginal information of correlation to perform dimensionality re- 
duction. The idea of using marginal information to deal with high dimensionality has also appeared 
independently in Huang, Horowitz and Ma (2008) who proposed to use marginal bridge estimators to 
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select variables for sparse high dimensional regression models. We now look at SIS in the context of 
classification, in which the idea of independent screening appears natural and has been widely used. 

The problem of classification can be regarded as a specific case of the regression problem with 
response variable taking discrete values such as ±1. For high dimensional problems Uke tumor classifi- 
cation using gene expression or proteomics data, it is not wise to classify the data using the full feature 
space due to the noise accumulation and interpretability. This is well demonstrated both theoretically 
and numerically in Fan and Fan (2008). In addition, many of the features come into play through linkage 
to the important ones (see, e.g.. Figure 1). Therefore feature selection is important for high dimensional 
classification. How to effectively select important features and how many of them to include are two 
tricky questions to answer. Various feature selection procedures have been proposed in the literature 
to improve the classification power in presence of high dimensionality. For example, Tibshirani et al. 
(2002) introduce the nearest shrunken centroids method, and Fan and Fan (2008) propose the Features 
Annealed Independence Rules (FAIR) procedure. Theoretical justification for these methods are given 
in Fan and Fan (2008). 

SIS can readily be used to reduce the feature space. Now suppose we have rii samples from class 1 
and n2 samples from class —1. Then the componentwise regression estimator Q becomes 

u=^iii- ^ Xj., (9) 

Written more explicitly, the j-th component of the p-vector u! is 

u!j = {niXj^i — n2^j,2)/SD of the j-th feature, 

by recalling that each covariate in ^ has been normalized marginally, where i is the sample average 
of the j-th feature with class label "1" and 2 is the sample average of the j-th feature with class label 
"— 1". When rii = 722, coj is simply a version of the two-sample t-statistic except for a scaling constant. 
In this case, feature selection using SIS is the same as that using the two-sample t-statistics. See Fan 
and Fan (2008) for a theoretical study of sure screening property in this context. 

Two-sample t-statistics are commonly used in feature selection for high dimensional classification 
problems such as in the significance analysis of gene selection in microarray data analysis (see, e.g.. 
Storey and Tibshirani, 2003; Fan and Ren, 2006) as well as in the nearest shrunken centroids method 
of Tibshirani et al. (2002). Therefore SIS is an insightful and natural extension of this widely used 
technique. Although not directly applicable, the sure screening property of SIS in Theorem 1 after some 
adaptation gives theoretical justification for the nearest shrunken centroids method. See Fan and Fan 
(2008) for a sure screening property. 

By using SIS we can single out the important features and thus reduce significantly the feature space 
to a much lower dimensional one. From this point on, many methods such as the linear discrimination 
(LD) rule or the naive Bayes (NB) rule can be applied to conduct the classification in the reduced feature 
space. This idea will be illustrated on a Leukemia data set in Section 3.3.3. 
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3 SIS based model selection techniques 



3.1 Estimation and model selection in the reduced feature space 

As shown later in Theorem 1 in Section 5, with the correlation learning, we can shrink the full model 
{!,••• ,p} straightforward and accurately down to a submodel A4 = A4^ with size d = [yn] = o{n). 
Thus the original problem of estimating the sparse p-vector /3 in ([Hi reduces to estimating a sparse 
(i- vector /3 = (/3i, • • • , (3d)^ based on the now much smaller submodel M, namely, 

y = + £, (10) 

where = (xi, • • • ,x„)^ denotes an n x d submatiix of X obtained by extracting its columns 
corresponding to the indices in M.. Apparently SIS can speed up variable selection dramatically when 
the original dimension p is ulti'a high. 

Now we briefly review several well-developed moderate dimensional techniques that can be applied 
to estimate the d-vector (3 in ([TOl l at the scale of d that is comparable with n. Those methods include 
SCAD in Fan and Li (2001) and Fan and Peng (2004), adaptive Lasso in Zou (2006), the Dantzig selector 
in Candes and Tao (2007), among others. 

3.1.1 Penalized least-squares and SCAD 

Penalization is commonly used in variable selection. Fan and Li (2001, 2006) give a comprehensive 
overview of feature selection and a unified framework based on penalized Ukelihood approach to the 
problem of variable selection. They consider the penalized least squares (PLS) 

^ n d 

i=i j=i 

where /3 = • • • , /S^)"^ G R"^ and p\_^ (•) is a penalty function indexed by a regularization parameter 
Xj . Variation of the regularization parameters across the predictors allows us to incorporate some prior 
information. For example, we may want to keep certain important predictors in the model and choose 
not to penalize their coefficients. The regularization parameters Xj can be chosen, for instance, by 
cross-vaUdation (see, e.g., Breiman, 1996 and Tibshirani, 1996). A unified and effective algorithm for 
optimizing penalized likelihood, called local quadratic approximation (LQA), was proposed in Fan and 
Li (2001) and well studied in Hunter and Li (2005). In particular, LQA can be employed to minimize 
the above PLS. In our implementation, we choose Xj = A and select A by BIC. 

An alternative and effective algorithm to minimize the penalized least-squares problem (fTTI ) is the 
local linear approximation (LLA) proposed by Zou and Li (2008). With the local linear approximation, 
the problem ([TTI ) can be cast as a sequence of penalized Li regression problems so that the LARS 
(Efron, et al. , 2004) or other algorithms can be employed. More explicitly, given the estimate , j = 
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Figure 2: Left panel: The SCAD penalty (solid) and its local linear (dashed) and quadratic (dotted) 
approximations at the point x = 4. Right panel: p^(-) for penalized Li (thin solid), SCAD with A = 1 
(dashed) and A = 1.5 (dotted) and adaptive Lasso (thick solid) with 7 = 0.5. 

1, • • • ,d} at the k-th iteration, instead of minimizing (ITTI ). one minimizes 

i-j:(y.-xf/3)^+j:z.fi/3,i, (12) 

i=i j=i 

which after adding the constant term Yl'j=iPXj{\Pj''^\) is a local linear approximation to £(/3) in ([TT]) . 
where = Ip^.d/^f ^1)1- Problem ([H is a convex problem and can be solved by LARS and other 
algorithm such as those in Friedman et al. (2007) and Meier, van der Geer and Biihlmann (2008). In 
this sense, the penalized least-squares problem ([TT]) can be regarded as a family of weighted penalized 
Li-problem and the function p'xi') dictates the amount of penalty at each location. The emphasis on 
non-concave penalty functions by Fan and Li (2001) is to ensure that penalty decreases to zero as | 
gets large. This reduces unnecessary biases of the penalized likelihood estimator, leading to the oracle 
property in Fan and Li (2001). Figure 2 depicts how the SCAD function is approximated locally by a 
linear or quadratic function and the derivative functions p\{-) for some commonly used penalty func- 
tions. When the initial value /3 = 0, the first step estimator is indeed LASSO so the implementation 
of SCAD can be regarded as an iteratively reweighted penalized Li-estimator with LASSO as an initial 
estimator. See Section 6 for further discussion of the choice of initial values = 1, • • • ,d}. 

The PLS (fTTI ) depends on the choice of penalty function px^ (•). Commonly used penalty functions 
include the £p -penalty, < p < 2, nonnegative garrote in Breiman (1995), and smoothly clipped abso- 
lute deviation (SCAD) penalty, in Fan (1997) and a minimax concave penality (MCP) in Zhang (2007) 
(see below for definition). In particular, the ^i-penalized least squares is called Lasso in Tibshirani 
(1996). In seminal papers, Donoho and Huo (2001) and Donoho and Elad (2003) show that penalized 
^o-solution can be found by penalized ^i-method when the problem is sparse enough, which implies that 
the best subset regression can be found by using the penalized ^1 -regression. Antoniadis and Fan (2001) 
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propose the PLS for wavelets denoising with irregular designs. Fan and Li (2001) advocate penalty 
functions with three properties: sparsity, unbiasedness, and continuity. More details on characteriza- 
tion of these three properties can be found in Fan and Li (2001) and Antoniadis and Fan (2001). For 
penalty functions, they showed that singularity at the origin is a necessary condition to generate spar- 
sity and nonconvexity is required to reduce the estimation bias. It is well known that £p-penalty with 
< p < 1 does not satisfy the continuity condition, £p -penalty with p > 1 does not satisfy the sparsity 
condition, and £i -penalty (Lasso) possesses the sparsity and continuity, but generates estimation bias, as 
demonstrated in Fan and Li (2001), Zou (2006), and Meinshausen (2007). 

Fan (1997) proposes a continuously differentiable penalty function called the smoothly clipped ab- 
solute deviation (SCAD) penalty, which is defined by 

p'^m) = A |/ < A) + i^^^M±/ > A)| for some a > 2. (13) 

Fan and Li (2001) suggest using a = 3.7. This function has similar feature to the penalty function 
A \(3\ / (1 + advocated in Nikolova (2000). The MCP in Zhang (2007) translates the flat part of the 
derivative of the SCAD to the origin and is given by 

p',(|/3|) = (aA-|/3|)+/a, 

which minimizes the maximum of the concavity. The SCAD penalty and MCP satisfy the above three 
conditions simultaneously. We will show in Theorem 5 in Section 5 that SIS followed by the SCAD 
enjoys the oracle properties. 

3.1.2 Adaptive Lasso 

The Lasso in Tibshirani (1996) has been widely used due to its convexity. It however generates estima- 
tion bias. This problem was pointed out in Fan and Li (2001) and formally shown in Zou (2006) even in 
a finite parameter setting. To overcome this bias problem, Zou (2006) proposes an adaptive Lasso and 
Meinshausen (2007) proposes a relaxed Lasso. 

The idea in Zou (2006) is to use an adaptively weighted £i penalty in the PLS (fTTI ). Specifically, he 
introduced the following penalization term 

d 

i=i 

where A > is a regularization parameter and uj = {toi, • • • , w^)^ is a known weight vector. He further 
suggested using the weight vector a) = 1/|/3|'^, where 7 > 0, the power is understood componentwise, 
and /3 is a root-n consistent estimator. In view of ([T2l ). the adaptive Lasso is really the implementation of 
PLS ([TTI ) with = using LLA. Its connections with the family of non-concave penalized 

least-squares is apparently from ([T2l ) and Figure 2. 

The case of 7 = 1 is closely related to the nonnegative garrote in Breiman (1995). Zou (2006) also 
showed that the adaptive Lasso can be solved by the LARS algorithm, which was proposed in Efron, 
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Hastie, Johnstone and Tibshirani (2004). Using the same finite parameter setup as that in Knight and 
Fu (2000), Zou (2006) establishes that the adaptive Lasso has the oracle properties as long as the tuning 

7 — 1 

parameter is chosen in a way such that \/y/n — > and An^~ ^ oo as n ^ oo. 
3.1.3 Dantzig selector 

The Dantzig selector was proposed in Candes and Tao (2007) to recover a sparse high dimensional 
parameter vector in the linear model. Adapted to the setting in (ITOl ). it is the solution to the following 
^i-regularization problem 

mmJICIIi subject to II (X7n)^r||^ < Add, (14) 

where A^^ > is a tuning parameter, r = y — X^vi^ is an n-vector of the residuals, and || • ||i and || • ||oo 
denote the l\ and l^o norms, respectively. They pointed out that the above convex optimization problem 
can easily be recast as a linear program: 

d 

min ^ Ui subject to — u < C < u and — A^crl < (J^mV ~ ^mO < A^al, 

i=l 

where the optimization variables are u = (ui, • • • , u^)^ and C, G R*^, and 1 is a d- vector of ones. 

We will show in Theorem 4 in Section 5 that an application of SIS followed by the Dantzig selector 
can achieve the ideal risk up to a factor of log d with d < n, rather than the original log p. In particular, 
if dimension p is growing exponentially fast, i.e., p = exp(0(n^)) for some > 0, then a direct 
application of the Dantzig selector results in a loss of a factor O(n^) which could be too lai^ge to be 
acceptable. On the other hand, with the dimensionality first reduced by SIS the loss is now merely of a 
factor log d, which is less than log n. 



^^^^H SCAD ^^^^H 




^^^^H DS ^^^^H 








SCAD 

B AdaLasso 


DS 





d' d p 

Figure 3: Methods of model selection with ultra high dimensionality. 



3.2 SIS based model selection methods 

For the problem of ultra-high dimensional variable selection, we propose first to apply a sure screening 
method such as SIS to reduce dimensionality from p to a relatively large scale d, say, below sample size 
n. Then we use a lower dimensional model selection method such as the SCAD, Dantzig selector. Lasso, 
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or adaptive Lasso. We call SIS followed by the SCAD and Dantzig selector SIS-SCAD and SIS-DS, 
respectively for short in the paper. In some situations, we may want to further reduce the model size 
down to d' < d using a method such as the Dantzig selector along with the hard thresholding or the 
Lasso with a suitable tuning, and finally choose a model with a more refined method such as the SCAD 
or adaptive Lasso. In the paper these two methods will be referred to as SIS-DS-SCAD and SIS-DS- 
AdaLasso, respectively for simplicity. Figure 3 shows a schematic diagram of these approaches. 

The idea of SIS makes it feasible to do model selection with ultra high dimensionality and speeds up 
variable selection drastically. It also makes the model selection problem efficient and modular. SIS can 
be used in conjunction with any model selection technique including the Bayesian methods (see, e.g., 
George and McCuUoch, 1997) and Lasso. We did not include SIS-Lasso for numerical studies due to 
the approximate equivalence between Dantzig selector and Lasso (Bickel, Ritov and Tsybakov, 2007; 
Meinshausen, Rocha and Yu, 2007). 

3.3 Numerical studies 

To study the performance of SIS based model selection methods proposed above, we now present two 
simulations and one real data example. 

3.3.1 Simulation I: "independent" features 

For the first simulation, we used the linear model ([T|l with i.i.d. standard Gaussian predictors and Gaus- 
sian noise with standard deviation a = 1.5. We considered two such models with (n,p) = (200, 1000) 
and (800, 20000), respectively. The sizes s of the true models, i.e., the numbers of nonzero coefficients, 
were chosen to be 8 and 18, respectively, and the nonzero components of the p- vectors (3 were randomly 
chosen as follows. We set a = Alogn/ ^/n and 51ogn/y^, respectively, and picked nonzero coeffi- 
cients of the form (—1)" (a + \z\) for each model, where u was drawn from a Bernoulli distribution 
with parameter 0.4 and z was drawn from the standard Gaussian distribution. In particular, the ^2-norms 
WPW of the two simulated models are 6.795 and 8.908, respectively. For each model we simulated 200 
data sets. Even with i.i.d. standard Gaussian predictors, the above settings are nontrivial since there is 
nonnegligible sample correlation among the predictors, which reflects the difficulty of high dimensional 
variable selection. As an evidence, we report in Figure 4 the distributions of the maximum absolute 
sample correlation when n = 200 and p = 1000 and 5000, respectively. It reveals significant sample 
correlation among the predictors. The multiple canonical correlation between two groups of predictors 
can be much larger. 

To estimate the sparse p-vectors /3, we employed six methods: the Dantzig selector (DS) using a 
primal-dual algorithm. Lasso using the LARS algorithm, SIS-SCAD, SIS-DS, SIS-DS-SCAD, and SIS- 
DS-AdaLasso (see Figure 3). For SIS-SCAD and SIS-DS, we chose d = [n/ log n] and for the last two 
methods, we chose d = n — I and d' = [n/ log n] and in the middle step the Dantzig selector was used 
to further reduce the model size from d to d' by choosing variables with the d' lai^gest componentwise 
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0.46 



Maximum absolute sample correlation 



Figure 4: Distributions of the maximum absolute sample correlation when n = 200, p = 1000 (solid 
curve) and n = 200, p = 5000 (dashed curve). 



(a) Histogram, n=200, p =1000 



(b) Histogram, n=800, p =20000 
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Figure 5: (a) Distribution of the minimum number of selected variables required to include the true 
model by using SIS when n = 200, p = 1000 in simulation I. (b) The same plot when n = 800, p = 
20000. 



magnitudes of the estimated d-vector (see Figure 3). 

The simulation results are summarized in Figure 5 and Table 1. Figure 5, produced based on 500 
simulations, depicts the distribution of the minimum number of selected variables, i.e., the selected 
model size, that is required to include all variables in the true model by using SIS. It shows clearly that 
in both settings it is safe to shrink the full model down to a submodel of size [n/ log n] with SIS, which 
is consistent with the sure screening property of SIS shown in Theorem 1 in Section 5. For example, for 
the case of n = 200 andp = 1000, reducing the model size to 50 includes the variables in the true model 
with high probability, and for the case of n = 800 and p = 20000, it is safe to reduce the dimension to 
about 500. For each of the above six methods, we report in Table 1 the median of the selected model 
sizes and median of the estimation errors ||/3 — /3|| in £2 -norm. Four entries of Table 1 are missing due 
to limited computing power and software used. In comparison, SIS reduces the computational burden 
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Table 1 : Results of simulation I 



Medians of the selected model sizes (upper entry) 
and the estimation errors (lower entry) 



p 


DS 


Lasso 


SIS-SCAD 


SIS-DS 


SIS-DS-SCAD 


SIS-DS-AdaLasso 


1000 


10^ 


62.5 


15 


37 


27 


34 




1.381 


0.895 


0.374 


0.795 


0.614 


1.269 


20000 






37 


119 


60.5 


99 



0.288 0.732 0.372 1.014 



(a) Histogram, n=200, p^=1000, s^=5 
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Figure 6: (a) Distribution of the minimum number of selected variables required to include the true 
model by using SIS when n = 200, p = 1000, s = 5 in simulation II. (b) The same plot when n = 
200, p = 1000, s = 8. (c) The same plot when n = 800, p = 20000. 



significantly. 

From Table 1 we see that the Dantzig selector gives nonsparse solutions and the Lasso using the 
cross-validation for selecting its tuning parameter produces large models. This can be due to the fact 
that the biases in Lasso require a small bandwidth in cross-validation, whereas a small bandwidth results 
in lack of sparsistency, using the terminology of Ravikumar et al. (2007). This has also been observed 
and demonstrated in the work by Lam and Fan (2007) in the context of estimating spai^se covariance or 
precision matrices. We should point out here that a variation of the Dantzig selector, the Gauss-Dantzig 
selector in Candes and Tao (2007), should yield much smaller models, but for simplicity we did not 
include it in our simulation. Among all methods, SIS -SCAD performs the best and generates much 
smaller and more accurate models. It is clear to see that SCAD gives more accurate estimates than the 
adaptive Lasso in view of the estimation errors. Also, SIS followed by the Dantzig selector improves the 
estimation accuracy over using the Dantzig selector alone, which is in line with our theoretical result. 

3.3.2 Simulation II: "dependent" features 

For the second simulation, we used similar models to those in simulation I except that the predic- 
tors are now correlated with each other. We considered three models with (n,p, s) = (200, 1000, 5), 
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Table 2: Results of simulation II 



Medians of the selected model sizes (upper entry) 
and the estimation errors (lower entry) 



p 


DS 


Lasso 


SIS-SCAD 


SIS-DS 


SIS-DS-SCAD 


SIS-DS-AdaLasso 


1000 


10^ 


91 


21 


56 


27 


52 


(s = 5) 


1.256 


1.257 


0.331 


0.727 


0.476 


1.204 




10^ 


74 


18 


56 


31.5 


51 


(s = 8) 


1.465 


1.257 


0.458 


1.014 


0.787 


1.824 


20000 






36 


119 


54 


86 



0.367 0.986 0.743 1.762 



(200, 1000, 8), and (800, 20000, 14), respectively, where s denotes the size of the true model, i.e., the 
number of nonzero coefficients. The three p-vectors (3 were generated in the same way as in simulation 
I. We set (cr, a) = (1, 2 log n/ y/n), (1.5, 4 \ogn/ y/n), and (2, 4 log n/-y/n), respectively. In particular, 
the ^2-norms \\(3\\ of the three simulated models are 3.304, 6.795, and 7.257, respectively. To introduce 
correlation between predictors, we first used a Matlab function sprandsym to randomly generate an 
s X s symmetric positive definite matrix A with condition number y/n/ log n, and drew samples of s 
predictors Xi, • • • ,Xs from AA(0, A). Then we took Z^+i, • • • , ~ AA(0, Ip^s) and defined the re- 
maining predictors as Xi = Zi + rXi^g, i = s + 1, • • • , 2s and Xi = Zi + {1 — r) Xi,i = 2s + l, - • • ,p 
with r = 1 — Alog n/p, 1 — 5log n/p, and 1 — 51ogn/p, respectively. For each model we simulated 
200 data sets. 

We applied the same six methods as those in simulation I to estimate the sparse p-vectors /3. For 
SIS-SCAD and SIS-DS, we chose d = log n], [|n/ log n], and [n/ log n], respectively, and for the 
last two methods, we chose d = n — 1 and d' = [|?i/logn], [|n/logn], and [n/logn], respectively. 
The simulation results are similarly summarized in Figure 6 (based on 500 simulations) and Table 2. 
Similar conclusions as those from simulation I can be drawn. As in simulation I, we did not include the 
Gauss-Dantzig selector for simplicity. It is interesting to observe that in the first setting here, the Lasso 
gives large models and its estimation errors are noticeable compare to the norm of the true coefficient 
vector (3. 

3.3.3 Leukemia data analysis 

We also applied SIS to select features for the classification of a Leukemia data set. The Leukemia data 
from high-density Affymetrix oligonucleotide an^ays were previously analyzed in Golub et al. (1999) 
and are available at http : / /vjww . broad . mit . edu/cgi-bin/cancer/datasets . cgi. There 
are 7129 genes and 72 samples from two classes: 47 in class ALL (acute lymphocytic leukemia) and 25 
in class AML (acute mylogenous leukemia). Among those 72 samples, 38 (27 in class ALL and 11 in 
class AML) of them were set as the training sample and the remaining 34 (20 in class ALL and 14 in 
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class AML) of them were set to be the test sample. 

We used two methods SIS-SCAD-LD and SIS-SCAD-NB that will be introduced below to cany out 
the classification. For each method, we first applied SIS to select d = [2n / log n] genes with n = 38 
the training sample size chosen above and then used the SCAD to get a family of models indexed by 
the regularization parameter A. Here, we should point out that our classification results ai^e not very 
sensitive to the choice of d as long as it is not too small. There are certainly many ways to tune the 
regularization parameter A. For simplicity, we chose a A that produces a model with size equal to the 
optimal number of features determined by the Features Annealed Independence Rules (FAIR) procedure 
in Fan and Fan (2008). 16 genes were picked up by their approach. Now we selected 16 genes and got 
a linear model with size 16 by using SIS-SCAD. Finally, the SIS-SCAD-LD method directly used the 
above linear discrimination rule to do classification, and the SIS-SCAD-NB method applied the naive 
Bayes (NB) rule to the resulted 16-dimensional feature space. 

The classification results of the SIS-SCAD-LD, SIS-SCAD-NB, and nearest shrunken centroids 
method in Tibshirani et al. (2002) are shown in Table 3. The results of the nearest shrunken centroids 
method were extracted from Tibshirani et al. (2002). The SIS-SCAD-LD and SIS-SCAD-NB both chose 
16 genes and made 1 test error with training errors and 4, respectively, while the nearest shrunken 
centroids method picked up 2 1 genes and made 1 training error and 2 test errors. 

Table 3: Classification enws on the Leukemia data set 



Method 


Training error 


Test error 


Number of genes 


SIS-SCAD-LD 


0/38 


1/34 


16 


SIS-SCAD-NB 


4/38 


1/34 


16 


Nearest shrunken centroids 


1/38 


2/34 


21 



4 Extensions of SIS 

Like modeling building in linear regression, there are many variations in the implementation of corre- 
lation learning. This section discusses some extensions of SIS to enhance its methodological power. In 
particular, an iterative SIS (ISIS) is proposed to overcome some weak points of SIS. The methodological 
power of ISIS is illustrated by three simulated examples. 

4.1 Some extensions of correlation learning 

The key idea of SIS is to apply a single componentwise regression. Three potential issues, however, 
might arise with this approach. First, some unimportant predictors that are highly correlated with the 
important predictors can have higher priority to be selected by SIS than other important predictors that 
are relatively weakly related to the response. Second, an important predictor that is marginally uncor- 
related but jointly con^elated with the response can not be picked by SIS and thus will not enter the 
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estimated model. Third, the issue of collinearity between predictors adds difficulty to the problem of 
variable selection. These three issues will be addressed in the extensions of SIS below, which allow 
us to use more fully the joint information of the covariates rather than just the marginal information in 
variable selection. 

4.1.1 ISIS: An iterative correlation learning 

It will be shown that when the model assumptions are satisfied, which excludes basically the three afore- 
mentioned problems, SIS can accurately reduce the dimensionality from ultra high to a moderate scale, 
say, below sample size. But when those assumptions fail, it could happen that SIS would miss some 
important predictors. To overcome this problem, we propose below an ISIS to enhance the method- 
ological power. It is an iterative applications of the SIS approach to variable selection. The essence is to 
iteratively apply a large-scale variable screening followed by a moderate-scale careful variable selection. 

The ISIS works as follows. In the first step, we select a subset of ki variables Ai = {Xi-^ , Xi^^ } 
using an SIS based model selection method such as the SIS -SCAD or SIS -Lasso. These variables were 
selected, using SCAD or Lasso, based on the joint information of [n/ log n] variables that survive after 
the correlation learning. Then we have an n-vector of the residuals from regressing the response Y over 
Xi-^ , • ■ ■ ) Xi^^ . In the next step, we treat those residuals as the new responses and apply the same method 
as in the previous step to the remaining p — ki variables, which results in a subset of k2 variables A2 = 
{Xj-^ , • ■ ■ ) ■^jk2 }• remark that fitting the residuals from the previous step on {Xi , • • • , Xp} \ Ai 
can significantly weaken the priority of those unimportant variables that are highly correlated with the 
response through their associations with Xi-^ Xi^^ , since the residuals are uncorrected with those 
selected variables in Ai . This helps solving the first issue. It also makes those important predictors that 
are missed in the previous step possible to survive, which addresses the second issue above. In fact, after 
variables in Ai entering into the model, those that are mai^ginally weakly correlated with Y purely due 
to the presence of variables in Ai should now be con^elated with the residuals. We can keep on doing 
this until we get i disjoint subsets ^1, • • • ,A£ whose union A = uf^-^Ai has a size d, which is less than 
n. In practical implementation, we can choose, for example, the largest / such that < n. From the 
selected features in A, we can choose the features using a moderate scale method such as SCAD, Lasso 
or Dantzig. 

For the problem of ultra-high dimensional variable selection, we now have the ISIS based model 
selection methods which are extensions of SIS based model selection methods. Applying a moderate 
dimensional method such as the SCAD, Dantzig selector. Lasso, or adaptive Lasso to A will produce 
a model that is very close to the true sparse model TW*. The idea of ISIS is somewhat related to the 
boosting algorithm (Freund and Schapire, 1997). In particular, if the SIS is used to select only one 
variable at each iteration, i.e., = 1, the ISIS is equivalent to a form of matching pursuit or a greedy 
algorithm for variable selection (Barron, et al, 2008). 
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4.1.2 Grouping and transformation of the input variables 

Grouping the input variables is often used in various problems. For instance, we can divide the pool 
of p variables into disjoint groups each with 5 variables. The idea of variable screening via SIS can be 
applied to select a small number of groups. In this way there is less chance of missing the important 
variables by taking advantage of the joint information among the predictors. Therefore a more reliable 
model can be constructed. 

A notorious difficulty of variable selection lies in the coUinearity between the covariates. Effective 
ways to rule out those unimportant variables that are highly correlated with the important ones are being 
sought after. A good idea is to transform the input variables. Two possible ways stand out in this regard. 
One is subject related transformation and the other is statistical transformation. 

Subject related transformation is a useful tool. In some cases, a simple linear transformation of the 
input variables can help weaken con^elation among the covariates. For example, in somatotype studies 
the common sense tells us that predictors such as the weights wi, W2 and at 2, 9 and 18 yeai^s are 
positively correlated. We could directly use wi , W2 and ws as the input variables in a linear regression 
model, but a better way of model selection in this case is to use less correlated predictors such as 
{wi,W2 — wi,ws — W2)^, which is a linear transformation of {wi,W2,ws)'^ that specifies the changes 
of the weights instead of the weights themselves. Another important example is the financial time series 
such as the prices of the stocks or interest rates. Differencing can significantly weaken the correlation 
among those variables. 

Methods of statistical transformation include an application of a clustering algorithm such as the 
hierarchical clustering or /c-mean algorithm using the correlation metrics to first group variables into 
highly correlated groups and then apply the sparse principal components analysis (PCA) to construct 
weakly correlated predictors. Now those weakly correlated predictors from each group can be regarded 
as the new covariates and an SIS based model selection method can be employed to select them. 

The statistical techniques we introduced above can help identify the important features and thus 
improve the effectiveness of the vanilla SIS based model selection strategy. Introduction of nonlinear- 
terms and transformation of variables can also be used to reduced the modeling biases of linear model. 
Ravrkumar et al. (2007) introduced sparse additive models (SpAM) to deal with nonlinear feature se- 
lection. 

4.2 Numerical evidence 

To study the performance of the ISIS proposed above, we now present thr^ee simulated examples. The 
aim is to examine the extent to which ISIS can improve SIS in the situation where the conditions of 
SIS fail. We evaluate the methods by counting the frequencies that the selected models include all the 
variables in the true model, namely the ability of correctly screening unimportant variables. 
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4.2.1 Simulated example I 

For the first simulated example, we used a linear model 

y = 5Xi + 5X2 + 5X3 + £, 

where Xi, • • • , Xp are p predictors and e ~ A^(0, 1) is a noise that is independent of the predictors. In 
the simulation, a sample of {Xi, • • • , Xp) with size n was drawn from a multivariate normal distribution 
A^(0, S) whose covariance matrix S = {aij)pxp has entries an = 1, i = I, ■ ■ ■ ,p and aij = p,i ^ j. 
We considered 20 such models characterized by (p, n, p) with p = 100, 1000, n = 20, 50, 70, and 
p = 0, 0.1, 0.5, 0.9, respectively, and for each model we simulated 200 data sets. 

For each model, we applied SIS and the ISIS to select n variables and tested their accuracy of 
including the true model {Xi , X2, X3}. For the ISIS, the SIS-SCAD with d = [n/ log n] was used at 
each step and we kept on collecting variables in those disjoint Aj's until we got n variables (if there 
were more variables than needed in the final step, we only included those with the largest absolute 
coefficients). In Table 4, we report the percentages of SIS, Lasso and ISIS that include the true model. 
All of these three methods select n — 1-variables, in order to make fair comparisons. It is clear that the 
collinearity (large value of p) and high-dimensionality deteriorate the performance of SIS and Lasso, 
and Lasso outperforms SIS somewhat. However, when the sample size is 50 or more, the difference in 
performance is very small, but SIS has much less computational cost. On the other hand, ISIS improves 
dramatically the performance of this simple SIS and Lasso. Indeed, in this simulation, ISIS always picks 
all true variables. It can even have much less computational cost than Lasso when Lasso is used in the 
implementation of ISIS. 

4.2.2 Simulated example II 

For the second simulated example, we used the same setup as in example I except that p was fixed to be 
0.5 for simplicity. In addition, we added a fourth variable to the model and the linear model is now 

Y = 5Xi + 5X2 + 5X3 - 15^X4 + e, 

where X^ ~ -^(0, 1) and has correlation ^ with all the other p — 1 variables. The way X4 was 
introduced is to make it uncorrected with the response Y. Therefore, the SIS can not pick up the true 
model except by chance. 

Again we simulated 200 data sets for each model. In Table 5, we report the percentages of SIS, 
Lasso and ISIS that include the true model of four variables. In this simulation example, SIS performs 
somewhat better than Lasso in variable screening, and ISIS outperforms significantly the simple SIS and 
Lasso. In this simulation it always picks all true variables. This demonstrates that ISIS can effectively 
handle the second problem mentioned at the beginning of Section 4. 1 . 
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Table 4: Results of simulated example I: Accuracy of SIS, Lasso and ISIS 
in including the true model {Xi, X2, X-^} 



p 


n 




p = Q 


p = 0.1 


p = 0.5 


p = 0.9 






SIS 


.755 


.855 


.690 


.670 




20 


Lasso 


.970 


.990 


.985 


.870 


100 




ISIS 


1 


1 


1 


1 






SIS 


1 


1 


1 


1 




50 


Lasso 


1 


1 


1 


1 






ISIS 


1 


1 


1 


1 






SIS 


.205 


.255 


.145 


.085 




20 


Lasso 


.340 


.555 


.556 


.220 






ISIS 


1 


I 


I 


I 






SIS 


.990 


.960 


.870 


.860 


1000 


50 


Lasso 


1 


1 


1 


I 






ISIS 


1 


1 


1 


1 






SIS 


1 


.995 


.97 


.97 




70 


Lasso 


1 


1 


1 


1 






ISIS 


1 


1 


1 


1 



Table 5: Results of simulated example II: Accuracy of SIS, Lasso and ISIS 
in including the true model {Xi,X2,X^,Xi\ 



p 


/j = 0.5 


71. = 20 


n = 50 


n = 70 




SIS 


.025 


.490 


.740 


100 


Lasso 


.000 


.360 


.915 




ISIS 


1 


1 


1 




SIS 


.000 


.000 


.000 


1000 


Lasso 


.000 


.000 


.000 




ISIS 


1 


1 


1 
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4.2.3 Simulated example III 

For the third simulated example, we used the same setup as in example II except that we added a fifth 
variable to the model and the linear model is now 

Y = 5Xi + 5X2 + 5X3 - 15^X4 + X5 + e, 

where X5 ~ A^(0, 1) and is uncorrelated with all the other p — I variables. Again X4 is uncorrelated 
with the response Y. The way X^ was introduced is to make it have a very small correlation with the 
response and in fact the variable X^ has the same proportion of contribution to the response as the noise 
e does. For this particular example, X5 has weaker marginal correlation with Y than Xq, • • • , Xp and 
hence has a lower priority to be selected by SIS. 

For each model we simulated 200 data sets. In Table 6, we report the accuracy in percentage of SIS, 
Lasso and ISIS in including the true model. It is clear to see that the ISIS can improve significantly over 
the simple SIS and Lasso and always picks all true variables. This shows again that the ISIS is able to 
pick up two difficult variables and X^, which addresses simultaneously the second and third problem 
at the beginning of Section 4. 

Table 6: Results of simulated example III: Accuracy of SIS, Lasso and ISIS 
in including the true model {Xi,X2, X3, X4, X^} 



p 




n = 20 


n = 50 


n = 70 




SIS 


.000 


.285 


.645 


100 


Lasso 


.000 


.310 


.890 




ISIS 


1 


1 


1 




SIS 


.000 


.000 


.000 


1000 


Lasso 


.000 


.000 


.000 




ISIS 


1 


1 


1 



4.2.4 Simulations I and II in Section 3.3 revisited 

Now let us go back to the two simulation studies presented in Section 3.3. For each of them, we applied 
the technique of ISIS with SCAD and d = [n/ log n] to select q = [71./ log n] variables. After that, we 
estimated the g-vector /3 by using SCAD. This method is referred to as ISIS-SCAD. We report in Table 7 
the median of the selected model sizes and median of the estimation errors ||/3 — /3|| in ^2 -norm. We can 
see clearly that ISIS improves over the simple SIS. The improvements are more drastic for simulation II 
in which covariates are more correlated and the variable selections are more challenging. 
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Table 7: Simulations I and II in Section 3.3 revisited: Medians of the selected 
model sizes (upper entry) and the estimation errors (lower entry) 



Simulation I 
p ISIS-SCAD 

1000 13 
0.329 



20000 



Simulation II 

ISIS-SCAD 

(s = 5) 11 
0.223 

(s = 8) 13.5 



0.366 

31 27 
0.246 0.315 



5 Asymptotic analysis 

We introduce an asymptotic framework below and present the sure screening property for both SIS and 
ITRRS as well as the consistency of the SIS based model selection methods SIS-DS and SIS-SCAD. 

5.1 Assumptions 

Recall from © that Y = Ya=i Pi^i + ^- Throughout the paper we let Al* = {1 < i < p : /3i / 0} be 
the true sparse model with nonsparsity size s = and define 

z = E-^/^x and Z = X'S''^/^, (15) 

where x = {Xi, • • • , Xp)"^ and S = cov (x). Clearly, the n rows of the transformed design matrix Z 
are i.i.d. copies of z which now has covariance matrix Ip. For simplicity, all the predictors Xi, - ■ ■ , Xp 
are assumed to be standardized to have mean and standard deviation 1. Note that the design matrix X 
can be factored into ZX)^/^. Below we will make assumptions on Z and S separately. 

We denote by Amax (•) and Amin (•) the lai^gest and smallest eigenvalues of a matrix, respectively. For 
Z, we are concerned with a concentration property of its extreme singular values as follows: 
Concentration Property: The random matrix Z is said to have the concentration property if there exist 
some c, ci > 1 and Ci > such that the following deviation inequality 

P (A,„ax(p~'ZZ^) > ci and A™n(p"'ZZ^) < 1/ci) < e'^^" (16) 

holds for any n x p submatrix Z of Z with cn < p < p. We will call it Property C for short. Property C 
amounts to a distributional constraint on z. Intuitively, it means that with large probability the n nonzero 
singular values of the n x p matrix Z are of the same order, which is reasonable since p~^ZZ will 
approach as ^ oo: the larger the p, the closer to In- It relies on the random matrix theory (RMT) to 
derive the deviation inequality in ([T6l ). In particular. Property C holds when x has a p-variate Gaussian 
distribution (see Appendix A.7). We conjecture that it should be shared by a wide class of spherically 
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symmetric distributions. For studies on the extreme eigenvalues and limiting spectral distributions, see, 
e.g., Silverstein (1985), Bai and Yin (1993), Bai (1999), Johnstone (2001), and Ledoux (2001, 2005). 

Some of the assumptions below are purely technical and only serve to provide theoretical under- 
standing of the newly proposed methodology. We have no intent to make our assumptions the weakest 
possible. 

Condition 1. p > n and logp = 0{n^) for some ^ G (0, 1 — 2k), where k is given by Condition 3. 

Condition 2. z has a spherically symmetric distribution and Property C. Also, e ~ AA(0, cr^) for 
some a > 0. 

Condition 3. var {Y) = 0(1) and for some k > and C2, C3 > 0, 

min > — and min |cov(/3-" Y,Xi)\ > C3. 

As seen later, k controls the rate of probability error in recovering the true sparse model. Although 
h = minjg^, |cov(/?r^y, is assumed here to be bounded away from zero, our asymptotic study 
applies as well to the case where b tends to zero as n ^ 00. In particular, when the variables in A^* are 
uncorrected, 6=1. This condition rules out the situation in which an important variable is marginally 
uncorrected with Y, but jointly correlated with Y . 

Condition 4. There exist some r > and C4 > such that 

Amax (S) < C4n^. 

This condition rules out the case of strong collinearity. 

The largest eigenvalue of the population covaiiance matrix XI is allowed to diverge as n grows. 
When there are many predictors, it is often the case that their covariance matrix is block diagonal or 
nearly block diagonal under a suitable permutation of the variables. Therefore A^ax {^) usually does 
not grow too fast with n. In addition. Condition 4 holds for the covariance matrix of a stationary time 
series (see Bickel and Levina, 2004, 2008). See also Grenander and Szego (1984) for more details on 
the characterization of extreme eigenvalues of the covariance matrix of a stationary process in terms of 
its spectral density. 

5.2 Sure screening property 

Analyzing the p-vector in ([J]) when p > n is essentially difficult. The approach we took is to first 
study the specific case with S = /p and then relate the general case to the specific case. 

Theorem 1. (Accuracy of SIS). Under Conditions 1^, if 2k, + t < 1 then there exists some 6 < 
1 — 2k — T such that when 7 ~ cn~^ with c > 0, we have for some C > 0, 
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We should point out here that s < [jn] is impUed by our assumptions as demonstrated in the 
technical proof. The above theorem shows that SIS has the sure screening property and can reduce from 
exponentially growing dimension p down to a relatively large scale d = [yn] = 0{n^^^) < n for some 
9 > 0, where the reduced model A4 = A4j still contains all the variables in the true model with an 
overwhelming probability. In particular, we can choose the submodel size d to be n — 1 or n/ log n for 
SIS if Conditions 1 -4 are satisfied. 

Another interpretation of Theorem 1 is that it requires the model size d = [jn] = n^' with 6* > 
2k + r in order to have the sure screening property. The weaker the signal, the larger the k and hence 
the larger the required model size. Similarly, the more severe the coUinearity, the lai^ger the r and the 
larger the required model size. In this sense, the restriction that 2k + t < 1 is not needed, but k < 1/2 
is needed since we can not detect signals that of smaller order than root-n consistent. In the former case, 
there is no guarantee that 6* can be taken to be smaller than one. 

The proof of Theorem 1 depends on the iterative application of the following theorem, which demon- 
strates the accuracy of each step of ITRRS. We first describe the result of the first step of ITRRS. It shows 
that as long as the ridge parameter A is large enough and the percentage of remaining variables 6 is lai^ge 
enough, the sure screening property is ensured with overwhelming probability. 

Theorem 2. (Asymptotic sure screening). Under Conditions 1^, if 2k + t < 1, X{p^/'^n)^^ oo, 
and — > oo n — > oo, then we have for some C > 0, 

P [M, C Mix) = 1 - 0(exp(-Cn^-271ogn)). 

The above theorem reveals that when the tuning parameters are chosen appropriately, with an over- 
whelming probability the submodel M.\x ^^^^ contain the true model M.^ and its size is an order (for 
some 9 > Q) lower than the original one. This property stimulated us to propose ITRRS. 

Theorem 3. (Accuracy of ITRRS). Let the assumptions of Theorem 2 be satisfied. If drfi oo as 
n — > oo for some 9 < 1 — 2k — r, then successive applications of the procedure in (121) /or k times resuhs 
in a submodel Ms^\ with size d = [5^p\ < n such that for some C > 0, 

P {M, C M5,x) = 1 - 0(exp(-Cni-27 log n)). 

Theorem 3 follows from iterative application of Theorem 2 k times, where k is the first integer such 
that [5^p] < n. This implies that k = 0(logp/logn) = 0{n^). Therefore, the accumulated error 
probability, from the union bound, is still of exponentially small with a possibility of a different constant 
C. 

ITRRS has now been shown to possess the sure screening property. As mentioned before, SIS is a 
specific case of ITRRS with an infinite regularization parameter and hence enjoys also the sure screening 
property. 

Note that the number of steps in ITRRS depends on the choice of 6 G (0, 1). In particular, 6 can 
not be too small, or equivalently, the number of iteration steps in ITRRS can not be too large, due to 
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the accumulation of the probability errors of missing some important variables over the iterations. In 
particular, the stepwise deletion method which deletes one variable each time in ITRRS might not work 
since it requires p — d steps of iterations, which may exceed the error bound in Theorem 2. 

5.3 Consistency of SIS-DS and SIS-SCAD 

To study the property of the Dantzig selector, Candes and Tao (2007) introduce the notion of uniform 
uncertainty principle (UUP) on deterministic design matrices which essentially states that the design 
matrix obeys a "restricted isometry hypothesis." Specifically, let A be an n x d deterministic design 
matrix and for any subset T C {1, • " " ,d}. Denote by At the 7i x |r| submatrix of A obtained by 
extracting its columns coiTcsponding to the indices in T. For any positive integer S < d, the 5-restricted 
isometry constant 6s = 5s (A) of A is defined to be the smallest quantity such that 

(1 - 6s) ||vf < llAyvf < (1 + 6s) ||vf 

holds for all subsets T with |T| < S and v G R'^' . For any pair of positive integers S, S' with S+S' < d, 
the S, 5' -restricted orthogonality constant 9s,s' = (^S,S'{^) of A is defined to be the smallest quantity 
such that 

|(Atv, A^-zv')! < 9s^s' \M\ 

holds for all disjoint subsets T, T' of cardinahties |r| < 5 and \T'\ < S", v G RI^I, and v' G RI^'L 

The following theorem is obtained by the sure screening property of SIS in Theorem 1 along with 
Theorem 1.1 in Candes and Tao (2007), where e ~ AA(0, a^I) for some cr > 0. To avoid the selection 
bias in the prescreening step, we can split the sample into two halves: the first half is used to screen 
variables and the second half is used to construct the Dantzig estimator. The same technique applies to 
SCAD, but we avoid this step of detail for simplicity of presentation. 

Theorem 4. (Consistency of SIS-DS). Assume with large probability, (52s (X^) + ^s,2s(X;vi) < t < 1 
and choose = ^/2\ogd in Tlien with large probability, we have 

<C{\ogd) sa^, 

where C = 32/ (1 — t)^ and s is the number of nozero components of (3. 

This theorem shows that SIS-DS, i.e., SIS followed by the Dantzig selector, can now achieve the 
ideal risk up to a factor of log d with d < n, rather than the original log p. 

Now let us look at SIS-SCAD, that is, SIS followed by the SCAD. For simplicity, a common reg- 
ularization parameter A is used for the SCAD penalty function. Let /3scad = (^Pi^ " " " ^ Pd^ be a 
minimizer of the SCAD-PLS in ([TTI ). The following theorem is obtained by the sure screening property 
of SIS in Theorem 1 along with Theorems 1 and 2 in Fan and Peng (2004). 

Theorem 5. (Oracle properties of SIS-SCAD). If d = o(n^/^) and the assumptions of Theorem 2 in 
Fan and Peng (2004) be satisfied, then, with probability tending to one, the SCAD-PLS estimator /3scad 
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satisfies: (i) Pi = Ofor any i A^*; (ii) the components of (3v,cad perform as well as if the true 

model M.^: were known. 

The SIS-SCAD has been shown to enjoy the oracle properties. 

6 Concluding remarks 

This paper studies the problem of high dimensional variable selection for the linear model. The concept 
of sure screening is introduced and a sure screening method based on correlation learning that we call 
the Sure Independence Screening (SIS) is proposed. The SIS has been shown to be capable of reduc- 
ing from exponentially growing dimensionality to below sample size accurately. It speeds up variable 
selection dramatically and can also improve the estimation accuracy when dimensionality is ultra high. 
SIS combined with well-developed variable selection techniques including the SCAD, Dantzig selector, 
Lasso, and adaptive Lasso provides a powerful tool for high dimensional variable selection. The tuning 
parameter d can be taken as d = [n/ log n] or d = n — \, depending on which model selector is used 
in the second stage. For non-concave penalized least-squares ([T2l ). when one directly applies the LLA 
algorithm to the original problem with d = p, one needs initial values that are not readily available. SIS 
provides a method that makes this feasible by screening many variables and furnishing the correspond- 
ing coefficients with zero. The initial value in (fT2l) can be taken as the OLS estimate if d = [n/ log n] 
and zero [corresponding to w^^^ = p^(0+)] when d = n — 1, which is LASSO. 

Some extensions of SIS have also been discussed. In particular, an iterative SIS (ISIS) is proposed 
to enhance the finite sample performance of SIS, particularly in the situations where the technical con- 
ditions fail. This raises a challenging question: to what extent does ISIS relax the conditions for SIS 
to have the sure screening property? An iteratively thresholded ridge regression screener (ITRRS) has 
been introduced to better understand the rationale of SIS and serves as a technical device for proving 
the sure screening property. As a by-product, it is demonstrated that the stepwise deletion method may 
have no sure screening property when the dimensionality is of an exponential order. This raises another 
interesting question if the sure screening property holds for a greedy algorithm such as the stepwise 
addition or matching pursuit and how large the selected model has to be if it does. 

The paper leaves open the problem of extending the SIS and ISIS introduced for the linear models 
to the family of generalized linear- models (GLM) and other general loss functions such as the hinge loss 
and the loss associated with the support vector machine (SVM). Questions including how to define asso- 
ciated residuals to extend ISIS and whether the sure screening property continues to hold naturally arise. 
The paper focuses only on random designs which commonly appear in statistical problems, whereas 
for many problems in fields such as image analysis and signal processing the design matrices are often 
deterministic. It remains open how to impose a set of conditions that ensure the sure screening property. 
It also remains open if the sure screening property can be extended to the sparse additive model in non- 
parametric learning as studied by Ravikumar et al. (2007). These questions are beyond the scope of the 
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current paper and are interesting topics for future research. 



A Appendix 

Hereafter we use both C and c to denote generic positive constants for notational convenience. 
A.l Proof of Theorem 1 

Motivated by the results in Theorems 2 and 3, the idea is to successively apply dimensionality reduction 
in a way described in (ITtI) below. To enhance the readability, we split the whole proof into two mains 
steps and multiple substeps. 

Step 1. Let 5 € (0, 1). Similarly to ([8]), we define a submodel 

A^;5 = {1 < i < p : \uJi\i'& among the first [5p\ largest of all} . (17) 

We arm to show that if 5 ^ in such a way that ^n^^^"^"^ ^ oo as n ^ oo, we have for some C > 0, 

P (m^ C M]] = 1 - 0(exp(-Cni-2'^/logn)). (18) 



The main idea is to relate the general case to the specific case with S = Ip, which is separately 
studied in Sections A.4-A.6 below. A key ingredient is the representation ( [T9l ) below of the p x p 
random matrix X^X. Throughout, let S = (Z-^Z)^ Z^Z and = (0, • • • , 1, • • • , 0)"^ be a unit vector 
in W with the i-th entry 1 and elsewhere, i = 1, • • • ,p. 

Since X = ZS^s, it follows from that 

X^X = pi:^/'^ij'^diag{ni,--- ,/i„,)US^/^ (19) 

where //i, • • • , fin n eigenvalues of p~^TIZ7 , U = (/„, 0)^^^^ U, and U is uniformly distributed on 
the orthogonal group 0{ji). By ([T]) and Q, we have 

u = X^X/3 + X^£ = I + r/. (20) 

We will study the above two random vectors ^ and rj separately. 
Step 1.1. First, we consider term ^ = (^i, • • • , ^pf" = X^X/3. 
Step 1.1.1. Bounding \\$,\\jrom above. It is obvious that 

diag (/i?, • • • , ) < [A„,ax(p~'ZZ^)] ' /„ 
and USU^ < Amax(5])I„. These and ^ lead to 

m? < p'A„,ax(S) [A^ax(p~'ZZ^)] ' /3^S^/2u^USi/2/3. (21) 
Let Q e 0{p) such that S^/^^ ^ Q^i- Then, it follows from Lemma 1 that 

/3^5]V2u^USi/2^ = '(Q^SQei,ei>i^ 5]V2^ ' (Sei,ei) , 
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where we use the symbol ^= to denote being identical in distribution for brevity. By Condition 3, 
||SV2^||^ = iS^'EjS < var (Y) = 0(1), and thus by Lemma 4, we have for some C > 0, 

P (^/3^i:V2u^usi/2^ > 0(^)) < 0(e-^"). (22) 



Since A^ax C^) = 0{n'^) and P (Amax(p~^ZZ^) > ci) < e"^!" by Conditions 2 and 4, (EB and ([22] 
along with BonfeiToni's inequality yield 



P > 0{n'+^p)j < 0(e"^'^). (23) 

1.1.2. Bounding i G M^^from below. This needs a delicate analysis. Now fix an arbitrary 
i G By (O, we have 

6 = pef Si/^tj'^diag (Ail, • • • , /^n) lJ5]^/2/3. 

Note that ||S^/2ei|| = y^var {Xi) = 1, ||Si/2/3|| = 0(1). By Condition 3, there exists some c > 
such that 

(sV2^,sV2gi)| = lAI |cov {f3i'Y,X,)\ > c/n\ (24) 
Thus, there exists Q £ 0{p) such that S^/^Cj = Qei and 

5]i/2/3 = S^/^i) Qei + 0(l)Qe2. 



Since (//i, • • • , /i„)^ is independent of U by Lemma 1 and the uniform distribution on the orthogonal 
group 0{p) is invariant under itself, it follows that 

(l]i/2/3, 1]i/2e,\ Ri + 0{p)R2 = 6,1 + ^i,2, (25) 



where R = (i?i, i?2, • • • , ^p) = U diag (/zi, • • • , fin) Uci. We will examine the above two terms 
and separately. Clearly, 

Ri > efu'^A„,in(p-iZZ^)/„Uei = \^,{p-^TL^) (Sei,ei) , 

and thus by Condition 2, Lemma 4, and Bonferroni's inequality, we have for some c > and C > 0, 

P(i?i < cn/p) < 0(e-^"). 

This along with (l24l ) gives for some c > 0, 

P(|e.,il <cni-«) <0(e-^"). (26) 

Similai^ly to Step 1.1.1, it can be shown that 

P (||Rf > 0{n/p)) < 0(e"^"). (27) 
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Since (/ii, • • • , is independent of U by Lemma 1, the argument in the proof of Lemma 5 appUes 
to show that the distribution of R = {R2, • • • , Rp)'^ is invariant under the orthogonal group 0{p — 1). 
Then, it follows that R = ||R|| W/||W||, where W = (l^i, • • • , VFp_i)^ ~ J\f{0, Ip-i), independent 
of ||R||. Thus, we have 

ii2 = ||R||VFi/||W||. (28) 

In view of (l27l) . (1281 ). and ^j^2 = 0{pR2), applying the argument in the proof of Lemma 5 gives for some 
00, 

^(|^i,2| >cV^|W^|) <0(e-^"), (29) 

where VF is a J\f{0, 1) -distributed random variable. 

Let Xn = cy/2Cn}~^ / ^/\ogn. Then, by the classical Gaussian tail bound, we have 

P[c^\W\ > < VV^ = 0(exp(-Cni-^71ogn)), 

which along with ([29l)and Bonferroni's inequaUty shows that 

P{\^i,2\ > Xn) <= 0(exp(-Cni-271ogn)). (30) 
Therefore, by BonfeiToni's inequality, combining (l25l) . (l26l ). and (l30l) together gives for some c > 0, 

P < cn^"'') < 0{exp{-Cn^~^''/\ogn)), i G M^. (31) 

Step 1.2. Then, we examine term 77 = (r/i, • • • , rfp) = X e. 
Step 1.2.1. Bounding \\r}\\ from above. Clearly, we have 

XX^ = ZEZ^ < ZAn,ax(S)/pZ^ = pAn,ax(S)A^ax(p-'ZZ^)I„. 

Then, it follows that 

\\r]f = e^XX^e < pA„,ax(S)A„,ax(p~'ZZ^) \\sf . (32) 

From Condition 2, we know that ef/o"^, • • • , e^/cr^ are i.i.d. Xi -distributed random variables. Thus, by 
( |47] ) in Lemma 3, there exist some c > and C > such that 

P (\\sf > cna'^'^ < e^^", 

which along with (l32l ). Conditions 2 and 4, and Bonferroni's inequality yields 

P {Wvf > 0{n^+^p)) < 0(e-^"). (33) 



Step 1.2.2. Bounding \rii\from above. Given that X = X, rj = X'^e ~ Af{0,a'^X'^ X). Hence, 
'>li\x=x ~ -^(0' {r]i\X = X)) with 



var {rii\X = X) = a'^eJX^Xei. (34) 
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Let £ be the event {var (r/i|X) < cn} for some c > 0. Then, using the same argument as that in Step 
1.1.1, we can easily show that for some C > 0, 

< 0(e-^"). (35) 

On the event £, we have 

P {\r]i\ > x\X) < P {^/m\W\ > x) for any x > 0, (36) 
where 14^ is a J\f{0, 1) -distributed random variable. Thus, it follows from (1351 ) and (l36l ) that 

P {\Vi\ >x)< 0(e-^") + P {^/m\W\ > x) . (37) 
Let = V 2cCn^~'^/ Vlog n. Then, invoking the classical Gaussian tail bound again, we have 
P {Vc^\W\ > x'^) = Oiexpi-Cn^-^" / logn)), 
which along with (|37] ) and Condition 1 shows that 

P (max\rji\ > o{n^^^)] < 0{pexp{-Cn^~'^'' / \ogn)) =0{exp{-Cn^~'^^ / logn)). (38) 



Step 1.3. Finally, we combine the results obtained in Steps 1.1 and 1.2 together. By BonfeiToni's 
inequality, it follows from (l20l) . (1231 ). (|3T]) . (1331 ). and (l38l) that for some constants ci, C2, C > 0, 

(39) 



P i min \u>i\ < cin^ or > C2n^^'^p ) < 0(sexp(-Cn^ ^'^/logn)). 



This shows that with overwhelming probability 1 — 0{s exp(— Cn^ ^'*/ log n)), the magnitudes of cuj, 
« G A^*, are uniformly at least of order n^"'^ and more importantly, for some c > 0, 

f 1 Ti^^^ p cp 

#<l<k<p: \ujk\ > min ^ < c— = ,2^_. , (40) 

where #{•} denotes the number of elements in a set. 

Now, we are ready to see from (l40l) that if 5 satisfies 5'n}~'^'^~'^ — > oo as n ^ oo, then ([TSll holds 
for some constant C > larger than that in ( [39l ). 

1 

Step 2. Fix an arbitrary r G (0, 1) and choose a shrinking factor (5 of the form {^)^^^ , for some in- 
teger k >1. We successively perform dimensionaUty reduction until the number of remaining variables 
drops to below sample size n: 

• First, carry out the procedure in (llTl) to the full model M.^^= { 1, • " " j p} ^^d get a submodel M\ 
with size [5p\ ; 

• Then, apply a similar- procedure to the model A^J and again obtain a submodel M.^ d M.\ with 

size [S'^p], and so on; 
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Finally, get a submodel Ais = ■M'^ with size d = [5''p] = [5^n] < n, where [(5*"' ^p] = [6^ ^n] > 



n. 



It is obvious that Ms = A^^, where ^ = 5^ < 1. 

Now fix an arbitrary 9i G (0, 1 — 2k — r) and pick some r < 1 very close to 1 such that Oq = 9i/r < 
1 — 2k — T. We choose a sequence of integers A; > 1 in a way such that 

j^i-2K-r ^ ^ j^eo ^ Q as ?i ^ oo, (41) 

where 5 = (^) Then, applying the above scheme of dimensionality reduction results in a submodel 
Ms = Mry, where 7 = 5*" satisfies 

^^r(i--2K-T) _^ ^ ^jjj ^^01 ^ Q as n ^ oo. (42) 

Before going further, let us make two important observations. First, for any principal submatrix 
of T, corresponding to a subset of variables. Condition 4 ensures that 

Amax(S°) < Amax(S) < C4n". 

Second, by definition. Property C in ([T6l ) holds for any n xp submatrix Z of Z with cn < p < p, where 
c > 1 is some constant. Thus, the probability bound in ( fTSl ) is uniform over dimension p G {cn,p]. 
Therefore, for some C > 0, by (|4TI ) and (fTSl) we have in each step 1 < i < A; of the above dimensionality 
reduction, 

P (m, C Mi\M, C M'f^) = 1 - 0(exp(-Cni-271ogn)), 



'<5 

which along with Bonferroni's inequality gives 

P C M^) = l-0{k exp(-Cni^2K/ log n)). (43) 

It follows from (|4TI ) that A; = 0(Iogp/ log n), which is of order 0{n^ / log n) by Condition 1. Thus, 
a suitable increase of the constant C > in (1431 ) yields 

P C A^^) = 1 - 0(exp(-C7ni-2«/ log n)). 

Finally, in view of (l42l ). the above probability bound holds for any 7 ~ cn~^, with 9 < 1 — 2k — t and 
c > 0. This completes the proof. 

A.2 Proof of Theorem 2 

One observes that ([8]) uses only the order of componentwise magnitudes of u'^, so it is invariant under 
scaling. Therefore, in view of Q we see from Step 1 of the proof of Theorem 1 that Theorem 2 holds 
for sufficiently large regularization parameter A. 

It remains to specify a lower bound on A. Now we rewrite the p-vector Acj'^ as 



Acj^ = u: 



Ip - {Ip + A-^X^X) ' 



u. 
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LetC = (Ci,--- XpY 
that 



and thus 



Ip - {Ip + A-^X^ X) ' a;. It follows easily from X^X = Y}I'^T7TT}I'^ 

An,ax(X^X) < pAn,ax(p~^ZZ^)A,^ax(S), 

IKf < [Amax [Ip - (Ip + A-^X^X)-!)]' 

< [A^ax(A"^X^X)]'||cjf 

< A-y [A,,ax(p"'ZZ^)]' [An,ax(S)]2 ||a;f , 
which along with (l39l ). Conditions 2 and 4, and Bonferroni's inequaUty shows that 

Pencil > 0{\-^n^p^/^)^ < 0{s exp{-Cn^-^'^ / \ogn)). 

Again, by Bonferroni's inequality and (l39l ). any A satisfying A~^n~2~p3/2 _ ^^n be used. 

Note that k + t/2 < 1/2 by assumption. So in particular, we can choose any A satisfying A(p^/^n)~^ — > 
oo as n ^ oo. 



A.3 Proof of Theorem 3 

Theorem 3 is a straightforward corollary to Theorem 2 by the argument in Step 2 of the proof of Theorem 
1. 

Throughout Sections A.4-A.6 below, we assume that p > n and the distribution of z is continuous 
and spherically symmetric, that is, invariant under the orthogonal group 0{p). For brevity, we use ^ (•) 
to denote the probability law or distribution of the random variable indicated. Let S"^~^(r) = {x G R'' : 
||x|| = r} be the centered sphere with radius r in g-dimensional Euclidean space R'^. In particular, 5''"^ 
is referred to as the unit sphere in R'^. 

A.4 The distribution of S = (Z^Z) ^ Z^Z 

It is a classical fact that the orthogonal group 0{p) is compact and admits a probability measure that is 
invariant under the action of itself, say, 

Q-9 = Q9, geO{p),QeO{p). 

This invariant distribution is referred to as the uniform distribution on the orthogonal group 0{p). We 
often encounter projection matrices in multivariate statistical analysis. In fact, the set of all p x p 
projection matrices of rank n can equivalently be regai^ded as the Grassmann manifold „ of all n- 
dimensional subspaces of the Euclidean space R^; throughout, we do not distinguish them and write 

gp,n = {t/^diag {In, 0)U:Ue 0{p)} . 
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It is well known that the Grassmann manifold „ is compact and there is a natural ©(p) -action on 
it, say, 

Q-g = Q^gQ, g ^ Qp,n,Q ^ 0{p). 

Clearly, this group action is transitive, i.e. for any (71,(72 G Qp,n, there exists some Q G 0{p) such 
that Q • gi = g2- Moreover, „ admits a probability measure that is invariant under the C'(p) -action 
defined above. This invariant distribution is referred to as the uniform distribution on the Grassmann 
manifold Gp,n- For more on group action and invariant measures on special manifolds, see Eaton (1989) 
and Chikuse (2003). 

The uniform distribution on the Grassmann manifold is not easy to deal with directly. A useful fact 
is that the uniform distribution on Qp^n is the image measure of the uniform distribution on (D(^p) under 
the mapping 

^ : 0{p) ^ Gp^n, V{U) = C/^diag (/„, 0) f/, G 0{p). 

By the assumption that z has a continuous distribution, we can easily see that with probability one, 
the nxp matrix Z has full rank n. Let ^/JIi, • • • , ^/Jhi be its n singular- values. Then, Z admits a singular 
value decomposition 

Z = VDiU, (44) 

where V G 0{n), U G 0{p), and Di is an n x p diagonal matrix whose diagonal elements are 
a/W' ■ ■ ■ 1 ^/Jhi■, respectively. Thus, 

Z^Z = U^diag (/ii, • • • , 0, • • • , 0) U (45) 

and its Moore-Penrose generalized inverse is 

n 

i=i 

where U"^ = (ui, • • • , Up). Therefore, we have the following decomposition, 

S = (Z^Z)^ Z^Z = U^diag {In, 0) U, U G 0{p). (46) 

From (l44l) . we know that Z = Vdiag [^/JIi, • • • , ^/Jjhl) {In, ^)nxp U> and thus 

(/„, 0)„^p U = diag (1/ • • • , 1/y^) V^Z. 

By the assumption that ^ (z) is invariant under the orthogonal group 0{p), the distribution of Z is also 
invariant under 0{p), i.e., 

ZQ = Z for any Q € 0{p). 

Thus, conditional on V and (^1 , • • • , /U„ )^, the conditional distribution of (/„ , 0),„ U is invariant under 
0{p), which entails that 

(i„,o)„,^ui^(/„,o)„,^u, 

where U is uniformly distributed on the orthogonal group 0{p). In particular, we see that (/ii, • • • , fi) 
is independent of (/„, 0)„xp U. Therefore, these facts along with (1461 ) yield the following lemma. 
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Lemma 1. ^ (^(/n,0)„xpUj = ^ [{ln,0)^^pl]j and{fii,--- , ^Xnf is independent of {I 

where U is uniformly distributed on the orthogonal group 0(j>) and /^i, • • • , /in ci^^ n- eigenvalues of 

TIl7 . Moreover, S is uniformly distributed on the Grassmann manifold Gp,n- 

For simplicity, we do not distinguish U and U in the above singular value decomposition (l44l) . 
A.5 Deviation inequality on (Sei, ei) 

2 

Lemma 2. ^ ((Sei, ei)) = 2^''a — » where Xn Xp-n independent -distributed random 

variables with degrees of freedom n and p — n, respectively, that is, (Sei,ei) has a beta distribution 
with parameters n/2 and [p — n) /2. 

Proof. Lemma 1 gives .if (S) = ^ (U'^'diag (/„, 0) U), where U is uniformly distributed on 0{p). 
Clearly, (Uei) is a random vector on the unit sphere 5^"^. It can be shown that Uei is uniformly 
distributed on the unit sphere S^^^. 

Let W = (VFi, • • • , WpY ~ AA(0, Ip). Then, we have Uei = W/||W|| and 

(Sei,ei) = (Ueif diag(4,0)Uei^ n 



This proves Lemma 2. □ 

Lemmas 3 and 4 below give sharp deviation bounds on the beta-distribution. 
Lemma 3. (Moderate deviation). Let .^i, • • • ,(,n be i.i.d. xf-distributed random variables. Then, 

( i) for any e > 0, we have 

P {n-\ii + --- + in)>l + e)< e-^^", (47) 
where = [e - log(l + e)] /2 > 0. 

(ii) for any e G (0, 1), we have 

P [n~Hii + --- + in)<l-e)< e-^^", (48) 

where = [s - log(l - e)] /2 > 0. 

Proof, (i) Recall that the moment generating function of a xf -distributed random variable ^ is 

M(t) = ^e*« = (1 - 2^)"^/^ (-00,1/2). (49) 

Thus, for any e > and < t < 1/2, by Chebyshev's inequality (see, e.g. van der Vaart and Wellner, 
1996) we have 

P (^^^V^ > 1 + < ^(^Sexp {t + . . . + = exp(-n/,(t)), 
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where (t) = ^ log (1 — 2t) + (1 + e) t. Setting the derivative (t) to zero gives t = 2(i+e) ' where 
/e attains the maximum = [e — log(l + e)] /2, e > 0. Therefore, we have 



P{n-\Ci + --- + U) > 1 + e) <e 



This proves ( |47l ). 

(ii) For any < e < 1 and t > 0, by Chebyshev's inequality and (l49l ). we have 



(n^Ha + • • • + en) < 1 - e) < ^^exp {t{l-^,) + ... + t{l- Cn)} = expi-nge{t)), 
where ge (t) = \ log (1 + 2t) - (1 - e) t. Taking t = e/(2 (1 - e)) yields (08]). □ 

Lemma 4. (Moderate deviation). For any C > 0, e;^/^; constants c\ and ci with < ci < 1 < C2 
such that 

P ( (Sei,ei) < ci- or > cg- ) < 46"^". (50) 

V p pj 

Proof. From Lemma 3, we know that (Sei,ei) =^ ^/t], where ^ is Xn "distributed and is Xp- 
distributed. Note that and are increasing in e and have the same range (0, oo). For any C > 0, 
it follows from the proof of Lemma 3 that there exist ci and C2 with < ci < 1 < C2, such that 
Si_ci = C and Ac^-i = C*- Now define 

^ = ||- < ci or > and 5 = |^ < ci or > . 

Let ci = ci/c2 and C2 = C2/C1. Then, it can easily be shown that 

(Sei,ei) < ci- or > C2- ^ C ^US. (51) 
P pj 

It follows from (l47l) and (l48l) and the choice of ci and C2 above that 

P(^)<2e-^" and P{B)<2e-^P. (52) 



Therefore, by p > n and Bonferroni's inequality, the results follow from (I5TI ) and (I521 ). □ 
A.6 Deviation inequality on (Sei , e2) 

Lemma 5. Let Sei = (Vi, V2, • • • , Vp) . Then, given that the first coordinate V\ = v, the random 
vector (V2, • • • , Vp)'^ is uniformly distributed on the sphere S^~'^{^/v — v"^). Moreover, for any C > 0, 
there exists some c > 1 such that 

P {\V2\ > c^p-^ \W\) < 3e-^", (53) 
where W is an independent M{0, \)-distributed random variable. 
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Proof. In view of (1461) . it follows that 



bVf = efSei = Fi 



where V = (Fi, ■ • • , Fp)'^. For any Q G ©(p - 1), let Q = diag (1, Q) G ©(p). Thus, by Lemma 1, 
we have 

QV (uQ^)^diag(/„,0) (UQ^) Qei 

i^U^diag(/„,0)Ueii^V. 

This shows that given Vi = v, the conditional distribution of (V2,--- ,Vp)'^ is invariant under the 
orthogonal group 0{p — 1). Therefore, given Vi = v, the random vector (V2, • • • , Vp)^ is uniformly 
distributed on the sphere S^~'^{Vv — v"^). 

Let Wi, • • • , VFp_i be i.i.d. AA(0, 1) -distributed random variables, independent of Vi. Conditioning 
on Vi, we have 

V. S , (54) 

^Hrt + . . . + n<-2_j 

Let C > be a constant. From the proof of Lemma 4, we know that there exists some C2 > 1 such that 

P(yi > C2n/p) < 2e~^". (55) 

It follows from (l48l) that there exists some < ci < 1 such that 

P{Wl + --- + Wl_^ <ci{p- 1)) < e-^(P-^) < e"^", (56) 

since p > n. Let c = \l cijcx. Then, by Vi — < V\ and Bonferroni's inequality, ( l53l) follows 
immediately from (|54b-(l56l). □ 

A.7 Verifying Property C for Gaussian distributions 



In this section, we check Property C in (116] ) for Gaussian distributions. Assume x has a p-variate Gaus- 
sian distribution. Then, the n x p design matrix X ^ AA(0, /„ ® XI) and 

i.e., all the entries of Z are i.i.d. J\f{0, 1) random variables, where the symbol (g) denotes the Kronecker 
product of two matrices. We will invoke results in the random matrix theory on exti'eme eigenvalues of 
random matrices in Gaussian ensemble. 

Before proceeding, let us make two simple observations. First, in studying singular- values of Z, the 
role of n and p is symmetric. Second, when p > n,hy letting W ~ AA(0, Imxp), independent of Z, and 

f Z \ 



^(n+m) xp 



w 



then the extreme singular values of Z are sandwiched by those of Z. Therefore, a combination of 
Lemmas 6 and 7 below immediately implies Property C in (fT6l) . 
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Lemma 6. Let p > n and Z ~ AA(0, Inxp)- Then, there exists some C > such that for any eigenvalue 
A o/p~^ZZ^ and any r > 0, 

Moreover, for each X, the same inequality holds for a median of ^/X instead of the mean. 

Proof. See Proposition 3.2 in Ledoux (2005) and note that Gaussian measures satisfy the dimension-free 
concentration inequality (3.6) in Ledoux (2005). □ 

Lemma 7. Let Z ~ AA(0, Inxp)- Ifp/n ^ 7 > 1 n — > oo, then we have 

lim median ( J Amax(f'~"^ZZ^) 1=1 + 7~"^^^ 

and 

' ^ -1/2 



liminf ^ f A„,-„(p-iZZ^)') > 1 



7 

Proof. The first result follows directly from Geman (1980): 

A^ax(p"'ZZ^) ^ (1 + 7-1/2^' as n ^ 00. 
For the smallest eigenvalue, it is well known that (see, e.g., Silverstein, 1985 or Bai, 1999) 

A™n(p"'ZZ^) ^ (1 - 7-1/2) 2 ^ ^ ^_ 

This and Fatou's lemma entails the second result. □ 

References 

[1] Antoniadis, A. and Fan, J. (2001). Regularization of wavelets approximations (with discussion). J. 
Amer Statist. Assoc. 96, 939-967. 

[2] Bai, Z.D. (1999). Methodologies in spectral analysis of large dimensional random matrices, A 
review. Statistica Sinica 9, 611-677. 

[3] Bai, Z.D. and Yin, Y.Q. (1993). Limit of smallest eigenvalue of a large dimensional sample covari- 
ance matrix. Ann. Prob. 21, 1275-1294. 

[4] Baron, D., Wakin, M. B., Duarte, M. F, Sarvotham, S. and Baraniuk, R. G. (2005). Distributed 
compressed sensing. Manuscript. 

[5] Barron, A., Cohen, A., Dahmen, W. and DeVore, R. (2008). Approximation and learning by greedy 
algorithms. The Annals of Statistics, to appear. 



39 



[6] Bickel, P. J. and Levina, E. (2004). Some theory for Fisher's linear discriminant function, "naive 
Bayes", and some alternatives when there are many more variables than observations. Bernoulli 
10, 989-1010. 

[7] Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. 
Statist. , to appear. 

[8] Bickel, P. J., Ritov, Y. and Tsybakov, A. (2007). Simultaneous analysis of Lasso and Dantzig 
selector. Manuscript. 

[9] Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37, 
373-384. 

[10] Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Ann. Statist. 24, 
2350-2383. 

[11] Candes, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much larger 
than n (with discussion). Ann. Statist., 35, 2313-2404. 

[12] Chikuse, Y. (2003). Statistics on Special Manifolds. Lecture Notes in Statistics. Springer- Verlag, 
Berlin. 

[13] Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of dimensionality. 
Aide-Memoire of a Lecture at AMS Conference on Math Challenges of the 21st Century. 

[14] Donoho, D. L. and Elad, M. (2003). Maximal sparsity representation via li minimization. Proc. 
Nat. Aca. Sci. 100, 2197-2202. 

[15] Donoho, D. L. and Huo, X. (2001). Uncertainty principles and ideal atomic decomposition. IEEE 
Trans. Inform. Theory 47, 2845-2862. 

[16] Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. 
Biometrika 81, 425^55. 

[17] Eaton, M. L. (1989). Group Invariance Applications in Statistics. Institute of Mathematical Statis- 
tics, Hayward, Cahfornia. 

[18] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discus- 
sion). Ann. Statist. 32, 407-499. 

[19] Fan, J. (1997). Comments on "Wavelets in statistics: A review," by A. Antoniadis. J. Italian Statist. 
Assoc. 6, 131-138. 

[20] Fan, J. and Fan, Y. (2008). High dimensional classification using features annealed independence 
rules. Ann. Statist. , to appear. 



40 



[21] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle 
properties. /. Amer. Statist. Assoc. 96, 1348-1360. 

[22] Fan, J. and Li, R. (2002). Variable selection for Cox's proportional hazards model and frailty 
model. Ann. Statist. 30, 74-99. 

[23] Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: feature selection in 
knowledge discovery. Proceedings of the International Congress of Mathematicians (M. Sanz- 
Sole, J. Soria, J.L. Varona, J. Verdera, eds.) Vol. Ill, 595-622. 

[24] Fan, J and Ren, Y. (2006). Statistical analysis of DNA microarray data. Clinical Cancer Research 
12, 4469-4473. 

[25] Fan, J. and Peng, H. (2004). Nonconcave penalized likeUhood with diverging number of parame- 
ters. Ann. Statist. 32, 928-961. 

[26] Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools 
(with discussion). Technometrics 35, 109-148. 

[27] Friedman, J., Hastie, T., Hofling, H., and Tibshirani, R. (2007). Pathwise coordinate optimization. 
The Annals of Applied Statistics, 1, 302-332 

[28] Freund, Y., Schapire, R.E. (1997). A decision-theoretic generalization of on-line leai^ning and an 
application to boosting. Jour Comput. Sys. Sci., 55, 119-139. 

[29] Geman, S. (1980). A limit theorem for the norm of random matrices. Ann. Probab. 8, 252-261. 

[30] George, E. I. and McCuUoch, R. E. (1997). Approaches for Bayesian variable selection. Statistica 
Sinica 7, 339-373. 

[31] Golub, T. R., Slonim, D. K., Tamayo, P, Huard, C, Gaasenbeek, M., Mesirov, J. P, Coller, H., Loh, 
M. L., Downing, J. R., Cahgiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular 
classification of cancer: class discovery and class prediction by expression monitoring. Science 
286, 531-537. 

[32] Greenshtein, E. (2006). Best subset selection, persistence in high dimensional statistical learning 
and optimization under li constraint. Ann. Statist. 34, 2367-2386. 

[33] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection 
and the virtue of overparametrization. Bernoulli 10, 971-988. 

[34] Grenander, U. and Szego, G. (1984). Toeplitz Forms and Their Applications. Chelsea, New York. 

[35] Gribonval, R., Mailhe, B., Rauhut, H., Schnass, K. and Vandergheynst, P. (2007). Avarage case 
analysis of multichannel thresholding. In Proc. ICASSP. 



41 



[36] Hall, P., Marron, J. S. and Neeman, A. (2005). Geometric representation of high dimension, low 
sample size data. J. Roy. Statist. Soc. Sen B 67, 427^44. 

[37] Huang, J., Horowitz, J. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse 
high-dimensional regression models. Ann. Statist., to appear. 

[38] Hunter, D. and Li, R. (2005). Variable selection using MM algorithms. Ann. Statist. 33, 1617-1642. 

[39] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components 
analysis. Ann. Statist. 29, 295-327. 

[40] Knight, K. and Fu, W. (2000). Asymptotics for Lasso-type estimators. Ann. Statist. 28, 1356-1378. 

[41] Lam, C. and Fan, J. (2007). Sparsistency and rates of convergence in large covariance matrices 
estimation. Manuscript. 

[42] Ledoux, M. (2001). The Concentration of Measure Phenomenon. Mathematical Surveys and 
Monographs 89, AMS. 

[43] Ledoux, M. (2005). Deviation Inequalities on Largest Eigenvalues. GAFA Seminar Notes, to ap- 
pear. 

[44] Meier, L., van de Geer, S. and Buhlmann, P. (2008). The group Lasso for logistic regression. 
Journal of the Royal Statistical Society, B, 70, 53-71. 

[45] Meinshausen, N. (2007). Relaxed Lasso. Computational Statistics and Data Analysis, 52, 374-393. 

[46] Meinshausen, N. and Buhlmann, P. (2006). High dimensional graphs and variable selection with 
the Lasso. Ann. Statist. 34, 1436-1462. 

[47] Meinshausen, N., Rocha, G. and Yu, B. (2007). Discussion of "The Dantzig selector: statistical 
estimation when p is much larger than n". Ann. Statist., 35, 2373-2384. 

[48] Nikolova, M. (2000). Local strong homogeneity of a regularized estimator. SIAM J. Appl. Math. 
61, 633-658. 

[49] Paul, D., Bair, E., Hastie, T., Tibshirani, R. (2008). "Pre-conditioning" for feature selection and 
regression in high-dimensional problems. Ann. Statist., to appeai\ 

[50] Ravikumar, P., Lafferty, J., Liu, H. and Wasserman, L. (2007). Sparse additive models. Manuscript. 

[51] Silverstein, J. W. (1985). The smallest eigenvalue of a lai'ge dimensional Wishart matrix. Ann. 
Prob. 13, 1364-1368. 

[52] Storey, J. D. and Tibshirani R. (2003). Statistical significance for genome-wide studies. Proc. Natl. 
Aca. Sci. 100, 9440-9445. 

42 



[53] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. /. Roy. Statist. Soc. Sen B 
58, 267-288. 

[54] Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types 
by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 99, 6567-6572. 

[55] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. 
Springer- Verlag, New York. 

[56] Zhang, C.-H. (2007). Penalized linear unbiased selection. Manuscript. 

[57] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high- 
dimensional linear regression. Ann. Statist. , to appear. 

[58] Zhao, R and Yu, B. (2006). On Model Selection Consistency of Lasso. /. Machine Learning Res. 
7, 2541-2567. 

[59] Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Amen Statist. Assoc. 101, 1418- 
1429. 

[60] Zou, H. and Li, R. (2008). One-step spai^se estimates in nonconcave penalized likelihood models 
(with discussion). Ann. Statist., 36, 1509-1566. 



43 



